EDA Project

The EDA project in this course has four main parts to it:
1. Project Proposal 2. Phase 1 3. Phase 2 4. Report This notebook will be used for Project Proposal, Phase 1, and Phase 2. You will have specific questions to answer within this notebook for Project Proposal and Phase 1. You will also continue using this notebook for Phase 2. However, guidance and expectations can be found on Canvas for that assignment. The report is completed outside of this notebook (delivered as a PDF). Detailed instructions for that assignment are provided in Canvas.
Read this before proceeding: 1. Review the list of data sets and sources of data to avoid before choosing your data. This list is provided in the instructions for the Project Proposal assignment in Canvas.

2. It is expected that when you are asked questions requiring typed explanations you are to use a markdown cell to type your answers neatly. Do not provide typed answers to questions as extra comments within your code. Only provide comments within your code as you normally would, i.e. as needed to explain or remind yourself what each part of the code is doing.

Project Proposal

The intent of this assignment is for you to share your chosen data file(s) with your instructor and provide general information on your goals for the EDA project.
Step 1 (2 pts): Give a brief description of the source(s) of your data and include a direct link to your data.

https://github.com/BuzzFeedNews/2018-07-wildfire-trends/tree/master/dataThe data of Californina fire reports 1950 - 2017

https://www.macrotrends.net/states/california/population California Historical Population 1950 - 2020

Step 2 (2 pts): Briefly explain why you chose this data.

I am interested in the topic and the data seems to coming from a reliable source with great amount of historic detials regarding California fires

Step 3 (1 pt): Provide a brief overview of your goals for this project.

I tought it is interesting to see how wildfires gets so bad in the recent years. Of course that climite change is a big factor, but i believe there are other reasons that is causing this. In this projct I want to find the relationship between wildfire and population in California

Step 4 (1 pt): Read the data into this notebook.
Step 5 (1 pt): Inspect the data using the info( ), head( ), and tail( ) methods.
STOP HERE for your Project Proposal assignment. Submit your (1) original data file(s) along with (2) the completed notebook up to this point, and (3) the html file for grading and approval.
Instructor Feedback and Approval (3 pts): Your instructor will provide feedback in either the cell below this or via Canvas. You can expect one of the following point values for this portion. 3 pts - if your project goals and data set are both approved.
2 pts - if your data set is approved but changes to your project goals (Step 3) are needed.
1 pt - if your project goals are approved but your data set is not approved.
0 pts - if neither your data set nor your project goals are approved.

As needed, follow your instructor's feeback and guidance to get on track for the remaining portions of the EDA project.

EDA Phase 1

The overall goal of this assignment is to take all necessary steps to inspect the quality of your data and prepare the data according to your needs. For information and resources on the process of Exploratory Data Analysis (EDA), you should explore the EDA Project Resources Module in Canvas. Once you’ve read through the information provided in that module and have a comfortable understanding of EDA using Python, complete steps 6 through 10 listed below to satisfy the requirements for your EDA Phase 1 assignment. **Remember to convert code cells provided to markdown cells for any typed responses to questions.**
Step 6 (2 pts): Begin by elaborating in more detail from the previous assignment on why you chose this data?
1. Explain what you hope to learn from this data. 2. Do you have a hunch about what this data will reveal? (The answer to this question will be used in the Introduction section of your EDA report.)

1.Due to the increasing wildfire cases in california, I want to understand the relationship between wildfire and the population growth. 2.I believe there are positive correlation between the two.

Step 7 (2 pts): Discuss the popluation and the sample:
1. What is the population being represented by the data you’ve chosen? 2. What is the total sample size?

1.The population represents the wild fire reports that happend in the state of california from 1950 to 2017 2.According to The Fire and Resource Assessment Program (FRAP): "The data covered the period 1950 to 2001 and included USFS wildland fires 10 acres and greater, and CAL FIRE fires 300 acres and greater. BLM and NPS joined the effort in 2002, collecting fires 10 acres and greater. Also in 2002, CAL FIRE’s criteria expanded to include timber fires 10 acres and greater in size, brush fires 50 acres and greater in size, grass fires 300 acres and greater in size, wildland fires destroying three or more structures, and wildland fires causing $300,000 or more in damage. As of 2014, the monetary requirement was dropped and the damage requirement is 3 or more habitable structures or commercial structures."

Step 8 (2 pts): Describe how the data was collected. For example, is this a random sample? Are sampling weights used with the data?

The data is collected via fire reported using a combination of ground-based and satellite-based data

Step 9 (4 pts): In the Project Proposal assignment you used the info( ) method to inspect the variables, their data types, and the number of non-null values. Using that information as a guide, provide definitions of each of your variables and their corresponding data types, i.e. a data dictionary. Also indicate which variables will be used for your purposes.
Step 10 (10 pts): For full credit in this problem you'll want to take all necessary steps to report on the quality of the data and clean the data accordingly. Some things to consider while doing this are listed below. Depending on your data and goals, there may be additional steps needed than those listed here. 1. Are there rows with missing or inconsistent values? If so, eliminate those rows from your data where appropriate. 2. Are there any outliers or duplicate rows? If so, eliminate those rows from your data where appropriate. At each stage of cleaning the data, state how many rows were eliminated. 3. Are you using all columns (variables) in the data? If not, are you eliminating those columns? 4. Consider some type of visual display such as a boxplot to determine any outliers. Do any outliers need removed? If so, how many were removed? At each stage of cleaning the data, state how many rows were eliminated. It is good practice to get the shape of the data before and after each step in cleaning the data and add typed explanations (in separate markdown cells) of the steps taken to clean the data.
Include the rest of your work below and insert cells where needed.
STOP HERE for your EDA Phase 1 assignment. Submit your cleaned data file along with the completed notebook up to this point for grading.

EDA Phase 2

All of your work for the EDA Phase 2 assignment will begin below here. Refer to the detailed instructions and expectations for this assignment in Canvas.

Comments

We can see from the graphs above a positive correlation between the number of fires and the population of California overall. Both numbers have grown exponentially, which is worth investigating deeper.

Comments

From the description of the causes, I categorized the cause into Human, Natural, and Unknown. As the chart shows a large portion of the fire between 1950 and 2017 are caused by humans.

Summary

Based on the heat maps, it shows that the fire season for each year is between May and October. Natural fire occurs consistently within the fire season. On the other hand, the fire caused by humans appears to happen more often and expands beyond the fire season over the years.

The "Top 100 California Fires(1950 - 2017) Count by Year" chart also shows larger fires are becoming more common in recent years.

The first chart shows how the population in California has grown significantly in the past years. Although natural fires are becoming more frequent, there is an exponential amount of human-caused fires that were reported. In conclusion, the data shows that there is a positive correlation between the population and the fires that are reported in California.